Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.

Conversation

@yupbank
Copy link
Member

@yupbank yupbank commented Jun 29, 2018

Since tree algorithm is one of the most popular algorithm used in kaggle competition
and we already have a contrib project tensor_forest and people like them. If would be beneficial to move them inside of canned estimators.

cc: @nataliaponomareva

@ewilderj
Copy link
Contributor

Adding the overview information. This review will remain open for comment until the end of Monday, July 16th (allowing for public holidays).

TensorForest Estimator

Status Proposed
Author(s) Peng Yu (yupbank@gmail.com)
Sponsor Natalia P (Google)
Updated 2018-06-26

Objective

In this doc, we discuss the TensorForest Estimator API, which enables a user to create
Extremely Randomized Forest Classifier and Regressor. And by inheriting from the Estimator class, all the corresponding interfaces will be supported.

@ewilderj ewilderj added the RFC: Proposed RFC Design Document label Jul 2, 2018
@ewilderj ewilderj changed the title Propose Add tensor forest classifier and regressor to canned estimators Add TensorForest classifier and regressor to canned estimators Jul 2, 2018
@ewilderj ewilderj requested a review from ispirmustafa July 2, 2018 18:23
Copy link
Member

@martinwicke martinwicke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main comment would be that we should start with the minimum set of parameters that gives users the flexibility they need. It seems some of the parameters are not that useful, could we remove them to make the API simpler?

Some questions:

  • Could we have benchmarks for this?
  • Could you discuss whether there are efficiencies to be had for whole batch training? We spent a lot of time on such questions for the boosted tree Estimator, and I don't think we need to go into that much detail, but I would like to know whether there are obvious improvements we can make. Sometimes, this type of thing can influence the API (e.g., but requiring a separate pretraining input or something).

* **label_vocabulary:** A list of strings represents possible label values. If given, labels must be string type and have any value in `label_vocabulary`. If it is not given, that means labels are already encoded as integer or float within [0, 1] for `n_classes=2` and encoded as integer values in {0, 1,..., n_classes-1} for `n_classes`>2 . Also there will be errors if vocabulary is not provided and labels are string.
* **n_trees:** The number of trees to create. Defaults to 100. There usually isn't any accuracy gain from using higher values.
* **max_nodes:** Defaults to 10,000. No tree is allowed to grow beyond max_nodes nodes, and training stops when all trees in the forest are this large.
* **num_splits_to_consider:** Defaults to `sqrt(num_features)` capped to be between 10 and 1000. In the extremely randomized tree training algorithm, only this many potential splits are evaluated for each tree node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the 10 and 1000 boundaries universally accepted?

Nit, I would say "clipped", to my ear, "capped" only works for the upper bound.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now i just borrowed from the original contrib implementation. it is not universal though. not sure why origin author implemented this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really, sklearn ExtraTree is using sqrt(num_features) as default.
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py#L1192-L1202

Copy link
Member Author

@yupbank yupbank Jul 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so maybe i should remove this clip heuristic ?

* **max_nodes:** Defaults to 10,000. No tree is allowed to grow beyond max_nodes nodes, and training stops when all trees in the forest are this large.
* **num_splits_to_consider:** Defaults to `sqrt(num_features)` capped to be between 10 and 1000. In the extremely randomized tree training algorithm, only this many potential splits are evaluated for each tree node.
* **split_after_samples:** Defaults to 250. In our online version of extremely randomized tree training, we pick a split for a node after it has accumulated this many training samples.
* **bagging_fraction:** If less than 1.0, then each tree sees only a different, random sampled (without replacement), bagging_fraction sized subset of the training data. Defaults to 1.0 (no bagging) because it fails to give any accuracy improvement our experiments so far.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this gives no improvement, can we remove this argument?

We can always add stuff back, but we can never take it away (except at major versions) so we should be conservative in what we add.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now i just borrowed from the original contrib implementation, but thanks for your suggestion, i guess i can use the benchmark tool to find out whether the original claim is valid

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the origin paper did have some numbers suggestion bootstrapping is not helping, i'll remove it form the api for now.

@yupbank
Copy link
Member Author

yupbank commented Jul 4, 2018

For benchmark, yeah we might use this https://www.openml.org/search?q=ExtraTrees&type=flow

the efficiencies to be had for whole batch training?

In this case not really, since the trees in tensor forest we are using are Hoeffding Tree, which is a incremental tree, so we don't require full batch training

@yupbank yupbank force-pushed the propose-tensor-forest-estimator branch from 874907c to 16d1708 Compare July 8, 2018 02:33
@yupbank yupbank force-pushed the propose-tensor-forest-estimator branch from 500ecaa to 58d0bf8 Compare July 21, 2018 01:48
@ewilderj ewilderj changed the title Add TensorForest classifier and regressor to canned estimators RFC: Add TensorForest classifier and regressor to canned estimators Aug 1, 2018
@yupbank yupbank force-pushed the propose-tensor-forest-estimator branch from e84b8d9 to fc56dbc Compare August 9, 2018 01:08
@ewilderj ewilderj added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Aug 9, 2018
@ewilderj
Copy link
Contributor

ewilderj commented Aug 9, 2018

@nataliaponomareva @martinwicke I think we're good to merge this now. Waiting for your LGTM and I'll merge.

@martinwicke
Copy link
Member

Can we reflect the discussion notes somewhere here? Could be as a link to a doc, even in the comment thread. I just don't want them lost. @tanzhenyu

@ewilderj
Copy link
Contributor

ewilderj commented Aug 9, 2018

Agreed, before we've linked them at the bottom of the RFC. Either that or including them in this PR thread would also work, and we'll link the PR discussion at the bottom of the RFC.

@tanzhenyu
Copy link
Contributor

Talked with Edd offline, he will post it within the ready-to-pushed rfc.

@ewilderj
Copy link
Contributor

ewilderj commented Aug 9, 2018

Notes from the review committee meeting on 2018-08-07:

  • In core: agreed
  • Batch prediction; extra effort currently done: for inference using TensorFrame.
  • Reusing ops for Boosted trees: hard as completed different. Could have the same photo, and prediction & growing tree ops; Will look deeper on what things can be commonly done
  • Interface: head, take binary/multi-class/regressor head. The loss is different, not using loss in the head. The head will define metrics we need. Question is can any head work with TFForestEstimator? Answer: it should be. Should be able to detect what metrics/predictions/loss are neglected by the estimator. Will do that in the constructor.
  • Number of classes: will support multi class in the first version.
  • Random seed: in runConfig, use that instead. When we’re randomly sampling we will need the seed, also for shuffling and batching as well, currently supported
    dense numerical columns to start. Will add sparse column next, just supporting current use case for now.
  • n_trees, max_nodes, maximum memory size is 2G, batch must fit in memory, we could potentially split it out if size of trees and proto limit causes OOM.
  • Use None rather than 0 to represent no seed, Yes.
    label_dimension, the dimension of the output label.
  • Additional features: sample weight as the start. Sparse feature as next. Should we ask other contributors? Yu Peng will do it.
  • how to get done for external? Add API review label. We could do the prediction in a separate PR. Train can be splitter as local training, distributed training. But graph should be same for both.


- Simplified code with only limited subset of features (obviously, excluding all the experimental ones)
- New estimator interface, support for new feature columns and losses

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also copy over this from the doc
We will try to reuse as much code from canned boosted trees as possible (proto, inference etc)

@yupbank yupbank force-pushed the propose-tensor-forest-estimator branch from c363771 to 424fe65 Compare August 9, 2018 18:38
### Interface
### TensorForestClassifier

```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ```python to get the syntax highlighting?


### TensorForestRegressor

```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ```python to get the syntax highlighting?

4. Otherwise, `(x_i, y_i)` is used to update the statistics of every split in the growing statistics of leaf `l_i`. If leaf `l_i` has now seen `split_after_samples` data points since creating all of its potential splits, the split with the best score is chosen, and the tree structure is grown.


## BenchMark

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark

|Covertype| 581k| 54| 7| 83.0| 85.0|
|HiGGS| 11M| 28| 2| 70.9| 71.7|

With single machine training, TensorForest finishes much faster on big dataset like HIGGS, takes about one percent of the time scikit-lean required.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm where is time n this table? It is just performance metrics right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's just performance metrics, i took it from the workshop paper.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what i am saying that this statement "much faster on big dataset" is not substantiated by this table. You either keep the table and say that it is from resource A, demonstrating that the quality is on par with scikit learn, and removing the statement that says that it trains faster. Or add a reference to the resource which states that it trains faster

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see.. i'll add a source to the workshop paper, as the claim was also from the paper.

@ewilderj ewilderj merged commit 61fca22 into tensorflow:master Aug 10, 2018
@yupbank yupbank deleted the propose-tensor-forest-estimator branch August 10, 2018 18:02
theadactyl pushed a commit that referenced this pull request Oct 3, 2019
Update TFX notebook RFC after comments
ematejska pushed a commit that referenced this pull request Apr 15, 2020
Update 20191016-dlpack-support.md
omalleyt12 pushed a commit to omalleyt12/community that referenced this pull request Nov 30, 2020
Callback changes based on discussion
@Karenou
Copy link

Karenou commented Feb 24, 2021

Hi, may I know whether the tensor forest package is still supported under tensorflow 2.0.x? Many thanks.

ematejska pushed a commit that referenced this pull request Mar 4, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

RFC: Accepted RFC Design Document: Accepted by Review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants